feat: local review app, LoC+Zucker ingest, corpus audit (198 entries) by shaypal5 · Pull Request #37 · HeOCR/hash

shaypal5 · 2026-05-25T10:25:53Z

Summary

This PR captures a full session of data ingestion, corpus curation, and tooling work. It adds 198 verified entries to the corpus (net, after two audit passes) across two new sources, plus a complete local review application.

Data

New sources ingested

Source	Items reviewed	Accepted	Notes
LoC Hebraic Manuscripts	166 items / 722 pages	27 entries (16 items)	PDM-1.0; paginated JSON API
OPenn Zucker Ketubah Collection	288 items	1 entry	CC-BY-SA 4.0; Hebrew-text panel only; other 287 rejected (Aramaic/non-Hebrew-script)

Corpus audit passes

Two post-ingest audit passes over all existing entries removed out-of-scope material:

Pass 1: 164 entries removed (medieval Geniza fragments, printed pages, non-Hebrew script, misclassified items)
Pass 2: 11 entries removed (additional Geniza fragments + over-accepted LoC pages)

Net corpus after all removals: 198 entries, 57 active sources, 228 files, ~382 MiB.

Scan files

Adds scan directories for the 17 accepted sources only (16 LoC + 1 Zucker, ~25 MiB). Rejected/unreviewed scan directories from the same download sessions remain untracked.

PDF → JPEG thumbnails

Five corpus entries that had only a PDF file (no renderable image) were fixed by running pdftoppm -jpeg -r 200 to produce a _thumb.jpg per entry and adding it as role: thumbnail in the files list.

Tooling

Ingest scripts

scripts/ingest_loc.py — paginates the LoC JSON API, filters pre-1700/printed items, downloads up to 5 pages per item, writes data/review/loc_pending.jsonl
scripts/ingest_zucker.py — parses OPenn TEI manifests for the Zucker collection, writes data/review/zucker_pending.jsonl
scripts/merge_review.py — promotes approved decisions into entries.jsonl + sources.jsonl; auto-creates per-item source records so the entry-ID → source-ID constraint is always satisfied

Local review app (`scripts/review_app/`)

A Flask app (port 5757) for human review of pending batches and the verified corpus. Run with pip install flask && python scripts/review_app/app.py.

Home page — two/three-way view toggle:

By Writer — one card per author (name, death year, entry count, date range, sample thumbnail)
By Source — one card per source (title, provider, entry count, licenses, sample thumbnail)
All Entries — flat scrollable grid of all 198 entries

Corpus stats dashboard (top of home page):

Key metrics: entry count, source count, writer count, year span, transcript count
License breakdown bar: colour-coded segments with legend (counts + %)

Per-writer / per-source detail pages:

Entry cards with rights badge (✓ green / ⚠ yellow), transcript badge, zoomable lightbox
Source name shown under each card in writer view; creator names in source view

Corpus audit page (/audit):

All corpus entries in a grid; flag for removal + comment per entry
Filters: All / Flagged / ⚠ License unclear / With notes
Saves to data/review/audit_decisions.json

Global ✎ Actions toggle (nav header, all pages):

Off by default — clean browse mode
Reveals flag button + comment textarea on every entry card in every view
State persists in localStorage; same /api/audit/decide endpoint used everywhere

Batch review UI (/review/<batch_id>):

Inverted-accept pattern: all entries dim by default, click to accept
Progress bar, approved/rejected counts

Documentation

AGENTS.md: tightened corpus scope — 18th century minimum, cursive כתב יד only, Yiddish in Hebrew script in scope, Judeo-Arabic out of scope
docs/sources/wikimedia_queue.md: updated Wikimedia queue log
README.md, exports, NOTICE.md, CITATION.cff, datapackage.json: regenerated from current index

🤖 Generated with Claude Code

…er/source views - Removed 164 entries flagged during corpus audit (209 remain) - Marked 20 orphaned source records as rejected - Review app: redesign home with clickable By Writer / By Source card grids - Review app: new /writer/<slug> and /source/<id> per-group entry views - Review app: show transcript status badge per entry (status + license if present) - Review app: audit page now shows transcript info per entry - Updated exports and README status (209 entries, 57 sources with entries) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- License breakdown bar: segmented colour bar + legend with counts and % - Key metric blocks: entries, sources, writers, date range, transcript count - Warn block shown if any entries have unclear rights - compute_corpus_stats() helper in app.py; license short-names + colour map Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

5 entries stored only a PDF file, which browsers can't display inline as an <img>. Used pdftoppm at 200 DPI to produce a JPEG thumbnail for each, added as role=thumbnail prepended in the files list so the review app picks it up immediately. Original PDF kept as role=original. Affected entries: - commons__auerbach_letter_shtenzel_1961__p0001 - commons__bendin_semichah_shtenzel_1933__p0001 - commons__weidenfeld_eruv_letter_1947__p0001 - commons__wosner_halachic_ruling_1981__p0001 - commons__wosner_support_letter_1990__p0001 Validation: 111 sources, 209 entries, 242 files verified. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Actions toggle (✎ Actions button in nav header): - Hidden by default; one click reveals flag button + comment textarea on every entry card across all views (home, group pages, audit) - State persists in localStorage; CSS-driven via body.show-actions class - Audit submit button also gated behind the toggle All Entries view (third tab on home page): - Flat scrollable grid of all 209 entries, same card style as group pages - Includes rights/transcript badges, lightbox zoom, action strips - Browse save bar (sticky bottom) appears when Actions are on; saves to the same /api/audit/decide endpoint and merges with existing decisions Group pages (writer/source): - Flag/comment action strips added to each entry card (hidden by default) - Floating browse save bar; loads + merges with existing audit decisions New API endpoint GET /api/audit/decisions for client-side merge before save Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Removed entries (all flagged via audit UI 2026-05-25): - commons__bodleian_geniza_ms_heb_d_41_4b__p0001 (Geniza fragment) - commons__bodleian_geniza_ms_heb_e_39_78b__p0001 (Geniza fragment) - commons__chief_rabbinate_letter_1921__p0001 - commons__chushiel_letter_geniza__p0001 (Geniza) - commons__damascus_pentateuch_ms_heb_8_7088__p0001 - commons__geniza_education_ts_k5_13__p0001 (Geniza) - commons__grodzinski_letter_about_kook__p0001 - commons__halper462_exilarch_genealogy__p0001 - loc__2024422570__p0003, p0004, p0005 9 now-orphaned source records marked rejected. Corpus: 198 entries across 111 sources (228 files). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- scripts/ingest_loc.py: paginate LoC Hebraic Manuscripts JSON API, filter pre-1700/printed items, download up to 5 pages per item, write data/review/loc_pending.jsonl - scripts/ingest_zucker.py: parse OPenn TEI manifests for the Zucker Ketubah Collection, write data/review/zucker_pending.jsonl - scripts/merge_review.py: promote approved review decisions into entries.jsonl + sources.jsonl; auto-creates per-item source records - scripts/review_app/requirements.txt: Flask dependency for review app - scripts/review_app/templates/batch.html: batch review UI (invert accept pattern — all dim by default, click to accept) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- data/review/loc_pending.jsonl: 722 entries staged from LoC Hebraic Manuscripts collection; 166 items, up to 5 pages each - data/review/loc_decisions.json: 722 decisions (27 approved, 695 rejected) - data/review/zucker_pending.jsonl: 288 entries staged from OPenn Zucker Ketubah Collection - data/review/zucker_decisions.json: 288 decisions (1 approved, 287 rejected) These files serve as the audit trail for the two review sessions. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Adds only the scan directories for sources whose entries are in the corpus index (entries.jsonl). Rejected/unreviewed scan directories from the same download sessions remain untracked. Sources included: - 16 LoC Hebraic Manuscripts items (loc__2018757642 … loc__2023530858) accepted from the 166-item LoC review session (27 entries total) - openn__zucker__ket_z_238 — single accepted Zucker ketubah (Hebrew-text panel, CC-BY-SA 4.0) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…eview - app.py full rewrite: extract _enrich_entries() helper (eliminates 4 copy-pasted enrichment blocks), add mtime-keyed module-level file cache, add load_audit_decisions(live_ids) to filter stale decisions at load time, fix path traversal in serve_scan() with resolve()+ relative_to() check, fix save_decisions() to merge-not-clobber via existing.update(incoming), fix source_detail() 404 axis (check source_id not in sources, not len(entries)==0), fix review_batch() hardcoded source_id with primary_sid=max(set(...),key=count), fix walrus-operator double-call in group thumb helpers, remove dead imports (re, sys, datetime, timezone) - templates: slim full-entry JSON blobs to ID-only arrays (ENTRIES→ENTRY_IDS, ALL_ENTRIES→ALL_ENTRY_IDS) — eliminates ~588 KB of tojson payload per page load; update save loops to iterate IDs - group.html: remove dead fetch('/api/audit/status') try block in saveDecisions() that preceded the real merge fetch - data/review/audit_decisions.json: clear 175 stale decisions (all referenced entries removed from corpus in audit passes) - merge_review.py: prune stale IDs from audit_decisions.json after each batch merge so the file stays in sync with the live index Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

shaypal5 and others added 9 commits May 25, 2026 11:55

shaypal5 merged commit 7178d47 into main May 25, 2026
1 check failed

shaypal5 deleted the feat/review-app-and-corpus-audit branch May 25, 2026 12:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: local review app, LoC+Zucker ingest, corpus audit (198 entries)#37

feat: local review app, LoC+Zucker ingest, corpus audit (198 entries)#37
shaypal5 merged 9 commits into
mainfrom
feat/review-app-and-corpus-audit

shaypal5 commented May 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

shaypal5 commented May 25, 2026

Summary

Data

New sources ingested

Corpus audit passes

Scan files

PDF → JPEG thumbnails

Tooling

Ingest scripts

Local review app (scripts/review_app/)

Documentation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Local review app (`scripts/review_app/`)